IODS-project

This is a project related to the MOOC called IODS, which is hosted by Kimmo Vehkalahti.

The objective of this course is to get familiar with R, RMarkdown and GitHub and to be able to use them in, for example, statistical analysis. The course lasts for 7 weeks. Link to my github repository: https://github.com/vviljo/IODS-project

More information about this course can be found here (MOOC.helsinki) and here (blogs.helsinki.fi).

See also this useful R Markdown cheatsheet


2. Regression and model validation

This chapter consist of exercises related data wrangling and analysis. However, the code for data wrangling is not presented here as instructed. After some house keeping, the exercise is divided to 5 parts. The contents of these parts are introduced in the beginning of each part.

I start with some house keeping and calling the packages used later in the exercise.

rm(list=ls()) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa

Part 1: Importing data + brief overview

Here I start by importing the data that I produced in the data wrangling section. Studying the data with functions dim() and str() reveal that the dataset includes 166 rows (observations) and 7 columns (variables). Also, the data frame includes variables “gender” (factor), “age” (int), “attitude”(int),“deep”(num),“stra”(num),“surf”(num) and “points”(int). These things in the brackets describe the type of the variable.

#Read data from the source mentioned above and name it as "students14"
students2014 <- read.table("learning2014.txt", sep=",")
#dimensions of the data
dim(students2014)
## [1] 166   7
#dataset includes 166 rows (observations) and 7 columns (variables)
# look at the structure of the data
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
#the data frame includes variables "gender" (factor), "age" (int), "attitude"(int),"deep"(num),"stra"(num),"surf"(num) and "points"(int)

Part 2: More detailed description about the data

This dataset is based on http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt For more detailed information about the data and variables in the original dataset used for data wrangling, please see http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt The data was originally collected as a questionnaire related to teaching and learning, and the questionnaire was done as an international research project made possible by Opettajien akatemiea (teachers’ academy). After the data wrangling, our dataset used in the analysis (“students2014”) includes two background variables (“age”, “gender”), variables for exam points (“points”) and attitude (“attitude”) and three variables that describe indices derived from the original questionnaire and they are related to learning methods (“deep”,“stra”, “surf”).

Graphical overview and summary of the data:

# initialize plot with data and aesthetic mapping
p1 <- ggplot(students2014, aes(x = attitude, y = points, col=gender))

# define the visualization type (points)
p2 <- p1 + geom_point()

# add a regression line
p3 <- p2 + geom_smooth(method = "lm")

# add a main title and draw the plot
p4 <- p3+ggtitle("Student's attitude versus exam points")
p4

The graph “Student’s attitude versus exam points” suggests that there is a positive correlation with students attitude and exam points. This does not imply causality or statistical significance and the correlation is fairly difficult to perceive without drawing the lines.

# create a more advanced plot matrix with ggpairs()
p <- ggpairs(students2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20))) +ggtitle("Relationships between variables")

# draw the plot
p

A more detailed view is presented in the graph “Relationships between variables”, which gives information about the distributions of variables and how they correlate with other variables. For example, from this graph we learn that exam points have the highest correlation with attitude and lowest with variable “deep” (in absolute values). The graph also shows that the sample includes much more women than men and more detailed differences between men and women in this sample can be found on the first row.

#summary of variables
summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Studying the output of summary()-function we see the exact number of women (110) and men (56) in the sample and we learn about the distributions of variables. For example, average age in the sample is 25,51 while the youngest person in the sample was 17 and the oldest was 55.

Part 3: linear model: choosing variables, interpretation and p-values

In this part we study our linear model with exam points (“points”) as a dependent variable. In the first model (“my_model”), the explanatory variables are “attitude”, “stra” and “surf”. These variables are chosen as they have highest correlation (in absolute values) with our dependent variable.

# create a regression model with multiple explanatory variables
# chosen explanatory variables for the first model are "attitude", "stra" and "surf" as they have highest correlation with 
#our dependent variable "points" in absolute values
my_model <- lm(points ~ attitude + stra + surf, data = students2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Summary of our linear model suggests that in our model only the intercept and variable “attitude” have statistical significance (with P<0.01) in explaining “points”.

Interpretation of coefficients

The coefficients are interpreted such that an unit change in our explanatory variable is associated with a change size of the coefficient in the dependent variable (“points”). For example, one unit increase in “attitude” is associated with approximately 0.33 increase in exam points. Similarly, one unit increase in “stra”" is associated with 0.85 increase and one unit increase in “surf” is associated with 0.58 decrease in exam points.

From model 1 to model 2: dropping one variable

As suggested in instructions, I remove variable “surf” from the model as it is not statistically significant. The instructions were slightly ambivalent if I should remove one of non-significant variables or both. I chose to remove only one. “surf” was chosen to be removed as it had lower correlation with “points” in absolute values and it is also further away from being statistically significant. The new model is called “my_model2”. Interpretation follows in Part 4.

Part 4: New linear model, interpretation of coefficients and multiple R-squared

In this part, we study the new model with two explanatory variables (“attitude” and “stra”), as suggested in the end of Part 3.

# new model without "surf"
my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Interpretation of coefficients and multiple R-squared

The summary of the new model suggests again that the intercept and variable “attitude” have statistical significance (now with P<0.001). Now, also “stra” is somewhat significant (0.05<P<0.10 is still often interpreted as not being statistically significant). Now, one unit increase in “attitude” is associated with approximately 0.35 increase in exam points and one unit increase in “stra” is associated with approximately 0.91 increase in exam points. Our model has multiple R-squared of 0.2048. The interpretation is that our model explains 20.48% of the variation of our dependent variable (exam points) around its mean. The higher the value, the more variation is explained by our model.

The problem with R-squared is that it increases when more explanatory variables are added even if the new variables would make no sense. Thus, for multiple variable models Adjusted R-squared is useful as it punishes for adding explanatory variables and suggests adding more variables only if the new variable is really valuable such that the model is improved more than would be expected by chance.

Part 5: Graphical model validation: diagnostics of the model

This part focuses on diagnostics of the model and the analysis is based on graphical model validation based on the following plots: Residuals vs Fitted values (first graph), Normal QQ-plot (second graph) and Residuals vs Leverage (third graph).

plot(my_model2, which=c(1))

plot(my_model2, which=c(2))

plot(my_model2, which=c(5))

Assumptions of our model

Our model assumes that the relationship between our variables is linear as the dependent variable is modelled as a linear combination of model parameters. We also assume that the residuals

  • are normally distributed (with mean = 0 and variance=sigma^2 (which is constant))

  • are not correlated and that the size of a given error does not depend on the explanatory variables.

Analysing the residuals allows us to study the validity of the assumptions.

Studying the assumptions

Constant variance of errors: Residuals vs Fitted values

The size of the errors should not depend on the explanatory variables. To inspect this, we look and the first graph (Residuals vs Fitted values), in which there seems to be a reasonably random spread and no significant patterns. This suggests that the size of the errors does not depend on the explanatory variables.

Normally distributed residuals: Normal QQ-plot

QQ-plot is used to study if the errors are normally distributed. In the second graph we see that our residuals fall pretty well to the line even though in the tails of the graph we see some deviations.The interpretation in our case is that the assumption of normally distributed errors is fairly reasonable.

Leverage of observations: Residuals vs Leverage

Our third graph shows graphically how much impact a single observation has on the model. As outliers in the data may have a strong impact on our model, we wish to study if our model is highly affected by few single observations. In our case, as suggested by the graph, there are no single outliers that would have huge impact on our model.

Conclusion about the assumptions

They seem to hold pretty well as suggested by the previous subsections.


3. Logistic Regression

This chapter consist of exercises related data wrangling and analysis. However, the code for data wrangling is not presented here as instructed, please see it at the github repository. After some house keeping, the exercise is divided to 8 parts. The contents of these parts are introduced in the beginning of each part (and the headers are quite informative too). First part refers to creating this .Rmd-file, so I start with part 2 in the analysis. The topic of this chapter is logistic regression, where the dependent (target) variable is discrete.

I start with some house keeping and calling the packages used later in the exercise.

rm(list=ls()) 
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)

Part 2: Importing data + brief overview

First thing here is to import the joined student alcohol consumption data, after which I print out the names of the variables in the data and describe the data set briefly.

alc <- read.table("alc.txt", sep=",") 
#names of variables
names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
#for curiosity, dimensions and structure:
#dimensions of the data
dim(alc)
## [1] 382  35
# look at the structure of the data
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

The dataset “alc” includes 382 rows (observations) and 35 columns (variables). The types of variables can be checked from output of str(alc). Most of them are factors and integers with the exceptions of “alc_use” (num) and “high_use” (logical). The names of the variables are listed above. The dataset was imported from working directory as “alc.txt” was created earlier in the data wrangling part.

The original dataset is described as follows: “This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).” (https://archive.ics.uci.edu/ml/datasets/Student+Performance)"

In short, the dataset is a combination of two sets and it contains results of a questionnaire that is later used to study the relationship between alcohol consumption and school performance among other things while taking account various aspects listed above. (Please note that some of the more detailed description of variables and data is provided later in Part 4!)

Source of the data

The source of the data: https://archive.ics.uci.edu/ml/machine-learning-databases/00320/ (Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez) (more information about the data: https://archive.ics.uci.edu/ml/datasets/Student+Performance) The dataset “alc” used in this chapter is wrangled from the source mentioned above.

Part 3: Hypotheses about relationships between various variables and high alcohol consumption

In this part I briefly speculate about 4 variables having a relationship with high alcohol consumption. Below I present 4 hypothesis. Disclaimer: the hypothesis are to give only some reasoning about the relationship and they are not intended to cover all the relevant mechanisms as it wasn’t the task here.

  • “absences”: I expect high alcohol consumption to have a positive relationship with the number of school absences as consuming alcohol causes hangover, which makes it less pleasant to go to school. Also, choosing to consume a lot of alcohol indicates that the student’s preferences are heavily affected by instant gratifications as he/she is willing to sacrifice tomorrows utility for today’s utility. Using the same logic, as going to school could be interpreted as sacrificing todays’s utility for tomorrow’s, I’d expect the alcohol-preferring student to skip school.
  • “famrel”: I expect that family relations have a negative correlation with high alcohol consumption as good family relations could mean that the parents protect the student from using alcohol and as bad relationships with family may be a driver for consuming alcohol and not being at home.
  • “G3”: I expect grades to have a negative relationship with high alcohol consumption as consuming alcohol is associated with worse skills in coping with delayed gratification and as high alcohol consumption causes hangovers which makes it harder to study for exams etc.
  • “failures”: This is a similar variable as “G3” and “absences” in the way that I expect high alcohol consumption to have a negative relationship with performance in school: thus, I expect a positive relationship with number of past class failures and high alcohol consumption as consumption may signal low motivation for various reasons, cause hangovers that affect performance etc.

Part 4: Numerical and graphical exploration of distributions of variables and their relationships with alcohol consumption

I use a subset of the full dataset here to illustrate the relationships between the dependent and my choosing of independent variables. To use a subset, I create a data frame called “minialc” from the original “alc” so that I get to study the vatiables of interest and their relationships.

Alc_use<-alc$alc_use
High_use<-alc$high_use
Absences<-alc$absences
Failures<-alc$failures
G_3<-alc$G3
Famrel<-alc$famrel
mini<-cbind(Alc_use, High_use, Absences, Failures, G_3, Famrel)
minialc<-as.data.frame(mini)
p <- ggpairs(minialc, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

Studying this correlogram reveals quite a bit. I start with the interpretation of graphical distributions.

  • Alcohol consumption: right-skewed distribution, great majority (more than 50%) report alcohol consumption to be very low.

  • High use of alcohol: logical (binary) variable (true or false), majority don’t consume high amounts of alcohol.

  • Absences - number of school absences (numeric: from 0 to 93): heavily right-skewed distribution: overwhelming majority report 0 absences.

  • Failures- number of past class failures (numeric: n if 1<=n<3, else 4): heavily right-skewed distribution: overwhelming majority report 0 failures.

  • G_3 - final grade (numeric: from 0 to 20, output target): The “most normal” distribution among variables in this subset with the median being fairly middle.

  • Famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent): Likert-scale variable (thus not unimodal), most mass at 4 and the lower values have very small mass.

These descriptions can also be obtained from boxplots, which are probably a bit easier to read (see below), while the correlogram (above) gives other useful information too.

gather(minialc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+geom_bar()

Numerically, the correlogram shows that the direction of my hypothesis in part 3 were correct.

However, what is surprising in my opinion, is that the correlations with “Alc_use”" were fairly small in absolute values. Absences has clearly highest correlation (in absolute values) with alcohol usage (0.215) while others are way smaller (Failures 0.185, G_3 -0.156, Famrel -0.121).

Part 5: Statistical analysis: logistic regression

In this part, I use logistic regression to statistically explore the relationship between my choosing of dependent variables (Failures, Famrel, G_3, Asences) and the binary high/low alcohol consumption variable as the target variable. I present and interpret a summary of the fitted model and the coefficients of the model, and I provide confidence intervals for them. Results are presented and compared with respect to my hypothesis stated in Part 3.

m <- glm(high_use ~ failures + absences + famrel + G3, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + famrel + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2195  -0.7992  -0.6770   1.1782   1.9915  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.02926    0.67814   0.043 0.965585    
## failures     0.39080    0.19830   1.971 0.048752 *  
## absences     0.08174    0.02253   3.628 0.000285 ***
## famrel      -0.22402    0.12525  -1.789 0.073669 .  
## G3          -0.04377    0.03794  -1.154 0.248623    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 435.52  on 377  degrees of freedom
## AIC: 445.52
## 
## Number of Fisher Scoring iterations: 4

Summary of our model shows that among our explanatory variables, only “absences” and “failures” have a statistically significant relationship with “high_use” (p<0.05 -> null hypothesis of coef=0 can be rejected). They also have positive coefficients matching with the hypothesis while “famrel” and “G3” don’t have statistical significance although the negative coefficients match with the hypothesis.

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 1.0296913 0.2699340 3.891344
## failures    1.4781653 1.0018560 2.193923
## absences    1.0851775 1.0405060 1.136808
## famrel      0.7992957 0.6247181 1.022647
## G3          0.9571760 0.8883867 1.031268

Studying the Odds ratios (“OR” in the table above) informs us that as failures and absences have an OR higher than 1 (1.47 and 1.08, respectively), they are positively associated with high use of alcohol, while ORs lower than one for famrel and G3 are interpreted as they being negatively associated with high use of alcohol. In other words, and being a bit more accurate, the ORs are interpreted as follows: if student has failures, he/she is 1.48 times more likely to consume high amounts of alcohol compared to someone without failures. Other ORs are interpreted accordingly, for example, if a student has a good relationship with his/her family, he is less likely (OR = 0.8 <1) to consume high amounts of alcohol and if a student has high grades (G3), he/she is less likely (0.96 < 1) to consume high amounts of alcohol. NB: “famrel” and “G3” were not statistically significant at p<0.05.

Part 6: Exploring the predictive power of the model

This part focuses on analysing the predictive power of our new model in which we have dropped “famrel” and “G3” as they were not statistically significant.

# fit the model with only significant variables
m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
##     failures absences sex high_use probability prediction
## 373        1        0   M    FALSE   0.2934743      FALSE
## 374        1        7   M     TRUE   0.4313255      FALSE
## 375        0        1   F    FALSE   0.2154691      FALSE
## 376        0        6   F    FALSE   0.2968841      FALSE
## 377        1        2   F    FALSE   0.3303652      FALSE
## 378        0        2   F    FALSE   0.2303651      FALSE
## 379        2        2   F    FALSE   0.4484793      FALSE
## 380        0        3   F    FALSE   0.2459680      FALSE
## 381        0        4   M     TRUE   0.2622677      FALSE
## 382        0        2   M     TRUE   0.2303651      FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE     99   15

Above in our confusion matrix, which shows predictions versus the actual values. Below is a graphical visualization of both the actual values and the predictions.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use,col=prediction))

# define the geom as points and draw the plot
g+geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table%>%addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67539267 0.02617801 0.70157068
##    TRUE  0.25916230 0.03926702 0.29842932
##    Sum   0.93455497 0.06544503 1.00000000

The table above presents the accuracy of our predictions. From the table we can calculate that our model would be wrong in 28.5% of all cases (high use is FALSE and model predicts TRUE and vice versa). While the percentage is not the highest, it is way better than guessing randomly with equal weights. However, if one would predict “FALSE” every single time, one would be right with probability of 67.5%. Happily though, our model outperforms this guessing strategy too, which shows that our method has some value.

Part 7: Bonus: 10-fold cross-validation

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2853403
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2879581

Average prediction error of our model is 0.285, which means that in our training data our model gets 28.5% of predictions wrong on average. Then again, in cross-validation, average number of wrong predictions with 10-fold cross-validation is 0.296, which is higher than our rate for wrong predictions in our training data.

Comparing this to the results in DataCamp-exercise, my model has worse test set performance (bigger prediction error using 10-fold cross-validation) compared to the model introduced in DataCamp (which had about 0.26 error).

Next I try to see if I could find a model with smaller prediction error than 0.26 using a larger model. I do this by adding some variables to my original model, namely “famrel”, “G3”, “guardian”, “sex”, “higher”, “Pstatus”, “Medu” and “Fedu”.

# fit the model with only significant variables
m2 <- glm(high_use ~ failures + absences + famrel + G3+ guardian + sex + higher + Pstatus + Medu + Fedu, data = alc, family = "binomial")
cv2 <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cv2$delta[1]
## [1] 0.2722513

With this larger model, depending on the simulation, I get average number of wrong predictions in the 10-fold cross-validation from approximately 0.253 to 0.272, so it is not clear if this model performs better in prediction than the DataCamp model. In some simulations it does, and it performs better than my original model at least.

(((This implies that our model generalizes fairly well to independent data set as the difference in our average prediction error with the training data doesn’t differ much from the estimated wrong predictions in our cross-validation. In other words, our predictive model does relatively well in practice (=unseen data).)))

Part 8: Super-Bonus: Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors).

In this section I fit different numbers of variables to my logistic model and perform same analysis round after round dropping one variable every time. Then I compare the performance of these models in terms of the 10-fold cross-validation. First, I create the new models and calculate the estimated wrong predictions and then I plot this to graphically see if there is a trend.

m10 <- glm(high_use ~ failures + absences + famrel + G3+ guardian + sex + higher + Pstatus + Medu + Fedu, data = alc, family = "binomial")
cv10 <- cv.glm(data = alc, cost = loss_func, glmfit = m10, K = 10)
pe10<-cv10$delta[1]
pe10
## [1] 0.2670157
m9 <- glm(high_use ~ failures + absences + famrel + G3+ guardian + sex + higher + Pstatus + Medu, data = alc, family = "binomial")
cv9 <- cv.glm(data = alc, cost = loss_func, glmfit = m9, K = 10)
pe9<-cv9$delta[1]
pe9
## [1] 0.2748691
m8 <- glm(high_use ~ failures + absences + famrel + G3+ guardian + sex + higher + Pstatus, data = alc, family = "binomial")
cv8 <- cv.glm(data = alc, cost = loss_func, glmfit = m8, K = 10)
pe8<-cv8$delta[1]
pe8
## [1] 0.2670157
m7 <- glm(high_use ~ failures + absences + famrel + G3+ guardian + sex + higher, data = alc, family = "binomial")
cv7 <- cv.glm(data = alc, cost = loss_func, glmfit = m7, K = 10)
pe7<-cv7$delta[1]
pe7
## [1] 0.2722513
m6 <- glm(high_use ~ failures + absences + famrel + G3+ guardian + sex, data = alc, family = "binomial")
cv6 <- cv.glm(data = alc, cost = loss_func, glmfit = m6, K = 10)
pe6<-cv6$delta[1]
pe6
## [1] 0.2670157
m5 <- glm(high_use ~ failures + absences + famrel + G3+ guardian, data = alc, family = "binomial")
cv5 <- cv.glm(data = alc, cost = loss_func, glmfit = m5, K = 10)
pe5<-cv5$delta[1]
pe5
## [1] 0.3141361
m4 <- glm(high_use ~ failures + absences + famrel + G3, data = alc, family = "binomial")
cv4 <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
pe4<-cv4$delta[1]
pe4
## [1] 0.3062827
m3 <- glm(high_use ~ failures + absences + famrel, data = alc, family = "binomial")
cv3 <- cv.glm(data = alc, cost = loss_func, glmfit = m3, K = 10)
pe3<-cv3$delta[1]
pe3
## [1] 0.2958115
m2 <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
cv2 <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
pe2<-cv2$delta[1]
pe2
## [1] 0.2879581
m1 <- glm(high_use ~ failures, data = alc, family = "binomial")
cv1 <- cv.glm(data = alc, cost = loss_func, glmfit = m1, K = 10)
pe1<-cv1$delta[1]
pe1
## [1] 0.3062827
pe<-c(pe1, pe2,pe3, pe4, pe5, pe6,pe7,pe8,pe9,pe10)
x=c(1:10, pe)
plot(pe, type="b")

It seems that with a larger number of predictors the average prediction error gets smaller (observation at index = 10 refers to the average prediction error of a model with 10 predictors etc.). The interpretation here would be roughly such that the prediction is more accurate the more variables we fit to our model.

The trend is not linear in the graph as the errors represent a result from a single simulation and maybe repeating the simulations and averaging them could give us a better idea about how the average prediction errors behave with different number of predictors.


4. Clustering and classification

In this chapter I study “Boston” dataset from the MASS package. The chapter starts with an introduction to the dataset while the focus is on clustering and classification in which the idea is to figure out clusters of data points that are in some sense closer to each other than some other data points. After finding some clusters, new observations can be classified to these clusters. As discussed later, linear discriminant analysis (LDA) is one way to find and separate clusters from each other: LDA finds the (linear) combination of the variables that separate the target variable classes.

Part of this exercise includes data wrangling for next chapter, which is not commented here. For the data wrangling, see the github repository.

Analysis

The chapter is divided to subsections (Part 2 - Part 7 + bonus). Each subsection has a brief introduction to inform about its content.

I start with some house keeping and calling the packages used later in the exercise.

rm(list=ls()) 
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.84 loaded

Part 2: Importing data + brief overview

First thing here is to load the data from MASS-package, after which I print out the names of the variables in the data and describe the data set briefly. The “Boston” dataset is described here.

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The dataset contains information about the housing values in suburbs in Boston with 506 observations and 14 variables. The variables in our dataset are:

  • crim: per capita crime rate by town.

  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus: proportion of non-retail business acres per town.

  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox: nitrogen oxides concentration (parts per 10 million).

  • rm: average number of rooms per dwelling.

  • age: proportion of owner-occupied units built prior to 1940.

  • dis: weighted mean of distances to five Boston employment centres.

  • rad: index of accessibility to radial highways.

  • tax: full-value property-tax rate per $10,000.

  • ptratio: pupil-teacher ratio by town.

  • black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

  • lstat: lower status of the population (percent).

  • medv: median value of owner-occupied homes in $1000s

Source: Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. < em >J. Environ. Economics and Management < b >5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) < em >Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Part 3: Graphical overview of the data and summaries of the variables

In this subsection the dataset is introduced in a more detailed manner. The variables and their relationships are studied both graphically and numerically.

pairs(Boston)

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
cor_matrix<-cor(Boston) %>% round(2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b",tl.pos="d", tl.cex=0.6)

Going through the descriptions of all the distributions and correlations here would be tiresome. To make this easier, pairs() function and summary gives an understanding about the distributions of the variables while correlations between variables are plotted above with corrplot()-function. Correlations always get values between -1 and 1, where 1 refers to a situation in which variables are (perfectly) positively correlated and they move together hand in hand, and -1 refers to a situation where variables move together oppositely: when one decreases the other increases. In the graph above, the bigger the correlation in absolute values, the larger the circle. Red color refers to a negative correlation and blue color refers to a positive correlation. The darker the color, the stronger the correlation. Just to name a few, “tax” and “rad” have a strong positive correlation (0.91) such that the full-value property-tax rate is high in areas with good access to radial highways whereas “nox” and “dis” have a strong negative correlation (-0.77) such that the levels of pollution are high close to the five Boston employment centres.

Part 4: Standardizing the data, creating a categorial variable and train and test sets

Here I standardize the data and study how this affects our variables. I also create a categorical variable of the crime rate and divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

library(dplyr)
#center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime' -- I use the bins as breaks
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels=c("low", "med_low","med_high","high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

As shown in the summary of our variables above, scaling the variables changes the values of the observations. In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation. As a result, means of the variables become zero and the values vary in a smaller scale. Notice that I had “crim” in original set, but in the LDA-part I use “crime” as new variable. “Crime” is a categorical variable (see code above) that gets values “low”, “med_low”,“med_high” and “high” depending on the values of the original “crim” variable. For these categories, I created bins based on quantiles of the data. Also, the dataset is divided to train and test sets, so that 80% of the data belongs to the train set.

Part 5: Linear discriminant analysis

In this section I fit the linear discriminant analysis on the train set and use the recently created categorical crime rate “crime” as the target variable and all the other variables in the dataset as predictor variables. After that, I draw the LDA (bi)plot.

# linear discriminant analysis
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  1.01 -1.15 1.01 1.01 1.01 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  1.254 -0.817 0.512 0.253 0.244 ...
##  $ rm     : num  -0.8136 0.0688 -0.2585 -1.0242 -0.2429 ...
##  $ age    : num  1.0098 -1.8251 0.5871 0.0719 0.3988 ...
##  $ dis    : num  -0.887 0.674 -0.842 -0.822 -0.118 ...
##  $ rad    : num  1.66 -0.637 1.66 1.66 1.66 ...
##  $ tax    : num  1.529 0.129 1.529 1.529 1.529 ...
##  $ ptratio: num  0.806 -0.719 0.806 0.806 0.806 ...
##  $ black  : num  0.414 0.203 -3.879 -3.867 0.394 ...
##  $ lstat  : num  0.624 -0.744 1.49 0.631 0.326 ...
##  $ medv   : num  -0.80817 0.00731 -0.99301 -1.17785 -0.37325 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 4 1 4 4 4 1 4 1 4 2 ...
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2549505 0.2500000 0.2425743 
## 
## Group means:
##                   zn      indus        chas        nox         rm
## low       1.01198832 -0.8635173 -0.15653200 -0.8674458  0.4306908
## med_low  -0.09988897 -0.2833054  0.03346513 -0.5756565 -0.1885094
## med_high -0.37516530  0.1447626  0.31238879  0.3855037  0.1758338
## high     -0.48724019  1.0149946 -0.03128211  1.0507988 -0.4269049
##                 age        dis        rad        tax    ptratio
## low      -0.8821239  0.8671335 -0.6778652 -0.7098099 -0.4237050
## med_low  -0.3807479  0.3677486 -0.5503598 -0.4926129 -0.0449347
## med_high  0.3723609 -0.3651199 -0.3723877 -0.3081234 -0.3732236
## high      0.8072823 -0.8580511  1.6596029  1.5294129  0.8057784
##                black       lstat        medv
## low       0.38572158 -0.75495786  0.48955495
## med_low   0.34228307 -0.12265386 -0.02679088
## med_high  0.07355698 -0.03650536  0.23079430
## high     -0.81543974  0.82520717 -0.65949666
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.099639563  0.810988749 -0.86663980
## indus   -0.003335838 -0.133354254  0.26457507
## chas    -0.056805058 -0.088087040  0.10445051
## nox      0.315518505 -0.788838967 -1.25349948
## rm      -0.106731494 -0.103228116 -0.25292991
## age      0.338630101 -0.241946231 -0.16293807
## dis     -0.065441356 -0.315788771  0.12849544
## rad      3.215440012  0.831884642 -0.01204781
## tax     -0.012052514  0.003513246  0.42783581
## ptratio  0.128870289  0.111311696 -0.17297031
## black   -0.191451314  0.034839926  0.12616255
## lstat    0.163594913 -0.296947260  0.35003126
## medv     0.185928780 -0.453427270 -0.14092184
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9473 0.0387 0.0140
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

Accurate description for the biplot is not given here as it is a topic of next chapter of the IODS-project. However, the LDA and the biplot suggests that the most influencial linear separators for the clusters are naturally LDA1 and LDA2 and the arrows indicate that “rad”, “nox” and “zn” represent highest discrimination. In short, this means that index of accessibility to radial highways, the nitrogen oxides concentration (parts per 10 million) and the proportion of residential land zoned for lots over 25,000 sq.ft. have more variation than the other variables. The angles between arrows represent the correlations between the variables.

Part 6: Predicting classes with the LDA model

In this part the focus is on predicting the classes of the observations of test data with the LDA-model from previous part. To do this, I have earlier dropped the “crime” variable from the test dataset. Next, I observe how well the LDA model predicts the classes of the test dataset to study the performance of the LDA model.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13      11        1    0
##   med_low    2      17        4    0
##   med_high   1       9       15    0
##   high       0       0        1   28

The performance of the LDA model is illustrated in the table above. A quick look shows that most of the observations are on the diagonal elements, which suggests that the model predicts the classes of the test dataset relatively well. To be more precise, the model gets 76 out of 102 predictions correctly (thats approximately 75%) and 26 out of 102 wrong (thats approximately 25%).

It seems that the model has no difficulties categorizing the cases of high crime rate correctly (30/30) while other categories have some troubles in classification. The “low” is the most accurate after “high” with 70% of predictions of real “lows” being categorized correctly while the same numbers for “med_high” and “med_low” are 63% and 60%, respectively. Intuitively, it makes sense that the more extreme cases are easier to categorize correctly than the observations that fall in to the middle categories.

Part 7: Clusters

In this subsection, I reload the original “Boston” dataset, scale it and calculate the distances between the observations, after which run k-means algorithm on the dataset. The goal is to find optimal number of clusters and to visualize the results.

For the distances, I calculate both eucledean and manhattan distances, which are saved in distance martices “dist_eu” and “dist_man”. Notice that the distances are calculated from scaled Boston dataset (“bostonscaled”).

# load MASS and Boston
library(MASS)
data('Boston')
bostonscaled<-scale(Boston)
# euclidean distance matrix
dist_eu <- dist(bostonscaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(bostonscaled, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Summaries of the distance matrices show that the mean of eucledean distances is 4.9 (min. 0.13 and max. 14.39) while the mean of manhattan disctances is 13.5 (min. 0.27 and max. 48.7).

Next step here is to run the k-means algorithm to find the optimal number of clusters.

# k-means clustering
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km1 <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km1$cluster)

The optimal number of clusters is when the total of within cluster sum of squares (total WCSS) drops radically. Studying the qplot suggests that the optimal number could therefore be 2. The qplot show how the total WCSS behaves when the number of cluster changes.

Running the algorithm again and plotting the results gives us the graph of the scaled Boston dataset with 2 clusters (red and black). Interpreting the plot above suggests that compared to the red cluster, the black cluster is categorized by more crime (“crim”), higher levels of pollution (“nox”), closer proximity to railways (“rad”).

Bonus: Finding the most influencial linear separators for the clusters

Here we do the clustering again but with a larger number of clusters (>2) for the scaled Boston dataset. The qplot suggests that if not 2 clusters are used, 3 could be a good choice as well. The line drops quite evenly after 2 so that a larger number of clusters might also be justified. Thus, I set number of clusters to 3 according to qplot.

library(MASS)
data('Boston')
bostonscaled<-scale(Boston)
bostonscaled<-as.data.frame(bostonscaled)


# target classes as numeric
classes <- as.numeric(train$crime)
# k-means clustering
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bostonscaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(bostonscaled, centers = 3)

# plot the Boston dataset with clusters
pairs(bostonscaled, col = km$cluster)

Graph above illustrates the differences between the clusters. Now that the number of clusters is set to 3, I run the LDA again using the clusters as target classes and including all the variables in the Boston dataset to the model.

lda.fit2 <- lda(km$cluster ~., data = bostonscaled)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}




# plot the lda results
plot(lda.fit2, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit2, myscale = 3)

Fitting the LDA model again and studying the biplot suggests that the most influencial linear separators for the clusters are naturally LDA1 and LDA2. The most discrimination is represented by variables “rad”, “tax” and “age” as their arrows are the longest in the graph shown above. In short, this means that index of accessibility to radial highways,full-value property-tax rate per $10,000 and the proportion of owner-occupied units built prior to 1940 have more variation than the other variables.The angles between arrows represent the correlations between the variables.

Super bonus: matrix product and 3D-plot

In this part, I create a matrix product, which is a projection of the data point and draw a 3D plot where the color is defined by the clusters of the k-means.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
#tried this:plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$cluster)

The dimensions of the model_predictors dataset are 404 and 13. The first 3D-plot is created using plotply-package with colors assigned according to “crime” (categorical variable created earlier). In the second 3D-plot the colors should be assigned according to the clusters found using kmeans-method. I had an error in this part so that only the first 3D-plot could be created and the plots wont be compared here.


5. Dimensionality reduction techniques

In this chapter I study dimensionality reduction techniques. The chapter starts with an introduction to the dataset while the focus is on Principal Component Analysis (PCA). This chapter is divided to data wrangling and analysis.

Data wrangling and a brief introduction to the data

This section starts with introduction to the data from UNDP and proceeds with some wrangling and cleaning of the data. The data wrangling is included in the “create_human.R” file that can be accessed here. To see the code, please use the link. The file includes comments on the steps taken but in short what I do there is the following:

The “human.txt” data created previously is wrangled such that the Gross National Income (GNI) variable is numeric.

Next step here is to exclude unneeded variables such that I keep only the following columns: “Country”, “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F”. After this, I remove all rows with missing values. Also, as suggested by tail()-function, the “human” dataset includes observations which relate to regions instead of countries, namely the last 7 rows. I remove them from the dataset.

Last things here with respect to the data wrangling are to define the row names of the data by the country names and remove the country name column from the data. The data should now have 155 observations and 8 variables, which I check with str()-function. The new dataset is called “human_” and the corresponding .txt file is called “human.txt” such that the earlier version of “human.txt” is not available anymore although it can be accessed by running only the first part of the “create_human.R” file.

About the dataset

The data originates from the United Nations Development Programme (see: http://hdr.undp.org/en/content/human-development-index-hdi).

The wrangled version of the dataset consists of 155 observations of 8 variables that are:

  • Edu2.FM : the ratio of female and male populations with secondary education in each country
  • Labo.FM : the ratio of labour force participation of females and males in each country
  • Life.Exp : life expectancy at birth
  • Edu.Exp : Expected years of schooling (years)
  • GNI : The gross national income
  • Mat.Mor :The maternal mortality ratio
  • Ado.Birth : the adolescent birth rate
  • Parli.F : the share of parliamentary seats held by women
rm(list=ls()) 
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(MASS)
library(corrplot)
human_ <- read.table("human.txt")
str(human_)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Analysis

The chapter is divided to subsections (Part 1 - Part 5). Each subsection has a brief introduction to inform about its content.

Part 1: a graphical overview of the data and summaries of the variables in the data

To get a more accurate understanding about the variables used later in this chapter I study the correlations and distributions of the variables introduced in the previous section.

summary(human_)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human_)

The summaries give an idea about the distributions of the variables: for example, the variables are not scaled here so there are huge differences in the means. This is partly because some of the variables are ratios and others aren’t. Just to point out a couple of things, the summaries show that in the adolescent birth rate the differences between countries are quite staggering and while the median is “only” 33.6, maximum is very much bigger at 204.8. Also, the share of women in parliament is 20.91% on average with a maximum of 57.5% while there is at least one country with zero women in parliament.

Studying the ggpairs()-plot gives a more accurate picture of the data as it plots the data and distributions and shows the correlations between the variables. In short, while the first half of the variables (Edu2.FM, Labo.FM, Life.Exp and Edu.Exp) have negative skewness the latter half is quite heavily positively skewed.

To visualize the correlations more accurately, let’s look at the corrplot() below.

cor(human_)
##                Edu2.FM      Labo.FM   Life.Exp     Edu.Exp         GNI
## Edu2.FM    1.000000000  0.009564039  0.5760299  0.59325156  0.43030485
## Labo.FM    0.009564039  1.000000000 -0.1400125  0.04732183 -0.02173971
## Life.Exp   0.576029853 -0.140012504  1.0000000  0.78943917  0.62666411
## Edu.Exp    0.593251562  0.047321827  0.7894392  1.00000000  0.62433940
## GNI        0.430304846 -0.021739705  0.6266641  0.62433940  1.00000000
## Mat.Mor   -0.660931770  0.240461075 -0.8571684 -0.73570257 -0.49516234
## Ado.Birth -0.529418415  0.120158862 -0.7291774 -0.70356489 -0.55656208
## Parli.F    0.078635285  0.250232608  0.1700863  0.20608156  0.08920818
##              Mat.Mor  Ado.Birth     Parli.F
## Edu2.FM   -0.6609318 -0.5294184  0.07863528
## Labo.FM    0.2404611  0.1201589  0.25023261
## Life.Exp  -0.8571684 -0.7291774  0.17008631
## Edu.Exp   -0.7357026 -0.7035649  0.20608156
## GNI       -0.4951623 -0.5565621  0.08920818
## Mat.Mor    1.0000000  0.7586615 -0.08944000
## Ado.Birth  0.7586615  1.0000000 -0.07087810
## Parli.F   -0.0894400 -0.0708781  1.00000000
res1 <- cor.mtest(human_, conf.level = .95)
cor(human_) %>% corrplot(p.mat = res1$p, method = "color", type = "upper",
         sig.level = c(.001, .01, .05), pch.cex = .9,
         insig = "label_sig", pch.col = "white", order = "AOE")

This visualization shows the correlations between variables. The interpretation is as follows: deep blue color indicates a strong positive correlation and dark red incidates strong negative correlation. The lighter the color, the weaker (closer to zero) the correlation. For example, the share of women in parliament correlates very weakly with other variables as does the ratio of labour force participation of females and males in each country. The strongest positive correlation is between Life.Exp and Edu.Exp (0.79) while the strongest negative correlation is between Mat.Mor and Life.Exp (-0.86).

Part 2: Principal Component Analysis (PCA) on non-standardized dataset

In this section, I perform principal component analysis (PCA) on the not standardized human data and show the variability captured by the principal components. I also draw a biplot displaying the observations by the first two principal components.

The idea of PCA is explained later in this chapter in Part 4 among the interpretations!

# create and print out a summary of pca_human
pca_human_ <- prcomp(human_)
s <- summary(pca_human_)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- 100*round(1*s$importance[2, ], digits = 3)

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human_, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

As suggested by the PCA, the first component captures basically all of the variance here. Keep in mind, that the data is not standardized and thus PCA performs very badly. This is due to the staggering differences in the scales of variables: as some of the variables vary in a very small scale in absolute terms, others vary on a much larger scale. Here, the variable GNI is very heavily affecting the PC1. This part illustrates the problems of using PCA without standardizing the data and thus no further meaningful interpretations are provided as they would be very uninformative.

Part 3: Standardizing the data and performing PCA again + interpretations and comparisons

This section continues working with the PCA after standardizing the data. After running the PCA, I compare the findings to those from Part 2.

# standardize the variables
human_std <- scale(human_)
# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# create and print out a summary of pca_human
pca_human <- prcomp(human_std)
s <- summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# rounded percetanges of variance captured by each PC
pca_pr <- 100*round(1*s$importance[2, ], digits = 3)

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Standardizing the data before PCA makes a lot of sense. As explained above in Part 2, the staggering differences in the scales of variables cause a lot of harm unless they are standardized as some of the variables vary in a very small scale in absolute terms while others vary on a much larger scale. PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. This is also the explanation for why the results are very different. Here in the PCA with standardized variables the first principal component (PC1) captures 53.6% of all variation (not 100% like in the non standardized PCA above), the PC2 captures 16.2% and PC3 9.6% of all variation. The latter principal components also capture some variation (around 20% total).

In the biplot above, the scatter plot illustrates where the observations are represented by two principal components while the arrows are drawn to visualize the connections between the original variables and the principal components. The interpretation of the arrows is the following:

  • The angle between the arrows can be interpret as the correlation between the variables. For example, the share of parliamentary seats held by women (Parli.F) has the strongest positive correlation with the ratio of labour force participation of females and males in each country (Labo.FM).

  • The angle between a variable and a PC axis can be interpret as the correlation between the two. For example, the adolescent birth rate (Ado.Birth) correlates positively with the first principal component (PC1).

  • The length of the arrows are proportional to the standard deviations of the variables. For example, the ratio of labour force participation of females and males in each country (Labo.FM) has a greater standard deviation than the share of parliamentary seats held by women (Parli.F).

Part 4: Interpretations of the first two principal component dimensions based on the biplot

The idea of PCA is to reduce the dimensions of the dataset such that it constructs new characteristics that summarize the data in a meaningful way: the new characteristics are constructed such that they explain as much of the variance in the data as possible with the first principal component explaining the more than any other component, the second explaining less than the first but more than the third etc, and the PCs are orthogonal to each other. These new characteristics (the PCs) are constructed using the variables in the dataset.

Studying the biplot given in Part 3 and looking at the arrows of the variables with respect to the principal components it appears that the PC1 is heavily constructed on the basis of Mat.Mor, Ado.Birth, Edu.Exp, Edu2.FM, Life.Exp and GNI while PC2 is more or less defined by Parli.F and Labo.FM. This is suggested by the directions of the arrows as explained in Part 3: the angle between a variable and a PC axis can be interpret as the correlation between the two and those variables strongly correlated to the PC1 have a smaller angle with respect to the x-axis (PC1) while those correlated heavily with PC2 have a smaller angle with respect to y-axis (PC2).

In other words, the factors heavily defining (although the others also define this as well but not that heavily) PC1 explain most of the variance in the dataset (53.6%) while the second principal component, which explains 16.2% of the variance, is more dominated by Labo.FM and Parli.F. Thus, while the principal components don’t have a straight forward interpretation apart from their statistical properties, the first principal component seems to deal with factors related to development in a broader scale and could be interpreted as a development PC, the second PC is much more closely related to the role of women and therefore could be interpreted as a women PC.

Part 5: Multiple Correspondence Analysis on the tea data and interpretations

This section focuses on Multiple Correspondence Analysis (MCA) and the dataset “tea_time” is fetched from FactoMineR package. As PCA, MCA is also a dimensionality reduction technique. As in PCA, the goal is to detect and represent underlying structures in a dataset. MCA works well with categorical variables and it can also be used with qualitative data.

I start by looking at the structure and the dimensions of the data after which I visualize it.

First, after fetching the data, I choose my variables of interest by picking only the columns “Tea”, “How”, “how”, “sugar”, “where”, “lunch”. After this, I print the summaries and visualize the dataset. The dataset tea_time includes 300 observations of 6 variables.

library(FactoMineR)
library(tidyr)
library(dplyr)
data("tea")
# choosing column names to keep in the dataset
#the columns are chosen this way (not using "select()"-function as in DataCamp, because there is a problem when MASS-package and dplyr are both used since they both have a select()-function. However, this way gives same results as I just pick those columns from "tea" that I wish to use later.)
tea_time = tea[, c("Tea", "How", "how", "sugar", "where", "lunch")]
# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
#as with using select()-function, picking the columns from "tea" to "tea_time" gives the same data, which is confirmed by str().

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

The graph above gives a nice understanding about the nature of the variables, which are categorical. MCA is a nice tool here as PCA doesn’t work well with categorical variables.

Next, I execute the MCA.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

The summary of the MCA shows the variances and the percentages of variances for the dimensions. As in the PCA, also in MCA the first dimension explains more of the variance than the rest, the second less than the first but more than the rest etc.

The first dimension explains here 15.2% of the variance while the second explains 14.2% and the third 12.0% etc., as shown in the summary.

# visualize MCA
plot(mca, invisible=c("ind"), habillage="quali")

The visualization of the MCA gives a measure of similarity of the variables as a distance between them. In other words, the biplot gives an idea about the similarities between the variables. For example, “tea bag” and “chain store” are more similar than “tea bag” and “lemon”.

Notice that here the categories belonging to one variable have the same color, for example “tea bag”, “unpacked” and “tea bag+unpackaged” are the categories of the “how” variable and thus they are all green in the graph.

The graph shows that the most significant outliers of our data in terms of the first two dimensions (that explain the biggest and the second biggest part of the variance) are “unpacked” and “tea shop”.